Complexity-calibrated benchmarks for machine learning reveal when prediction algorithms succeed and mislead

Recurrent neural networks are used to forecast time series in finance, climate, language, and from many other domains. Reservoir computers are a particularly easily trainable form of recurrent neural network. Recently, a “next-generation” reservoir computer was introduced in which the memory trace involves only a finite number of previous symbols. We explore the inherent limitations of finite-past memory traces in this intriguing proposal. A lower bound from Fano’s inequality shows that, on highly non-Markovian processes generated by large probabilistic state machines, next-generation reservoir computers with reasonably long memory traces have an error probability that is at least \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 60\%$$\end{document}∼60% higher than the minimal attainable error probability in predicting the next observation. More generally, it appears that popular recurrent neural networks fall far short of optimally predicting such complex processes. These results highlight the need for a new generation of optimized recurrent neural network architectures. Alongside this finding, we present concentration-of-measure results for randomly-generated but complex processes. One conclusion is that large probabilistic state machines—specifically, large \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}ϵ-machines—are key to generating challenging and structurally-unbiased stimuli for ground-truthing recurrent neural network architectures.


Background
Section "Complex processes and ǫ-machines" describes ǫ-machines and Section "Recurrent neural networks" lays out the setup of the typical recurrent neural network (RNN) and reservoir computer (RC).

Complex processes and ǫ-machines
Each stationary stochastic process is uniquely represented by a predictive model called an ǫ-machine.This one- to-one association is particularly noteworthy as it gives explicit structure to the space of all such processes.One can either explore the space of stationary processes or, equivalently, the space of all ǫ-machines.This is made all the more operational, since ǫ-machines can be efficiently enumerated 13 .
In information theory they are viewed as process generators and described as minimal unifilar hidden Markov chains (HMC).In computation theory they are viewed as process recognizers and described as minimal probabilistic deterministic automata (PDA) 14,15 .Briefly, an ǫ-machine has hidden states σ ∈ S , referred to as causal states, and generates a process by emitting symbols x ∈ A over a sequence of state-to-state transitions.For purposes of neural-network comparison in the following, we explore binary-valued processes, so that A = {0, 1} .ǫ-Machines are unifilar or "probabilistic deterministic" models since each transition probability p(σ ′ |x, σ ) from state σ to state σ ′ given emitted symbol x are singly supported.More simply, there is at most a single destination state.In computation theory this is a deterministic transition in the sense that the model reads in symbols which uniquely determine the successor state.That said, these models are probabilistic as process generators: given that one is in state σ , a number of symbols x can be emitted, each with emission probability p(x|σ ) .In this way, these models represent stochastic languages-a set of output strings each occurring with some probability.
While every stationary process has an ǫ-machine presentation, it is usually not finite.An example is shown in Fig. 1 16 .The finite HMC on the top is nonunifilar since starting in state A and emitting a 0 does not uniquely determine to which state one transits-either A or B. The HMC on the bottom is unifilar, since in every state, knowing the emitted symbol uniquely determines the next state.Note that the ǫ-machine for the process generated by the finite nonunifilar HMC has an infinite number of causal states.Also, note that the process has infinite Markov order: if one sees a past of all 0s, one has not "synchronized" to the ǫ-machine's internal hidden state 17 , meaning that one does not know which hidden state of the ǫ-machine one is in.And, therefore, there is not a complete one-to-one correspondence between sequences of observed symbols and chains of hidden states.In contrast, with each step in the ǫ-machine presentation one inches closer to a one-to-one correspondence between observed symbols and hidden states-in reality, as close as possible.
In a way, a nonunifilar HMC is little more than a process generator 18 for which the equivalent ǫ-machine presentation has an infinite number of causal states.In another sense, ǫ-machines are a very special type of HMC generator since the ǫ-machine's causal states actually represent clusters of pasts that have the same conditional probability distribution over futures 14 .As a result, the causal states and so ǫ-machines are predictive.
Consider observing a process generated by a particular ǫ-machine and becoming synchronized so that you know the hidden state.(Now, this happens with probability 1 but it does not always happen 17 , as we just described with the nonunifilar HMC example.)Then you can build a prediction algorithm based on the known hidden state.The result, in fact, is the best possible prediction algorithm that one can build.Moreover, the latter is simple: when synchronized to hidden state σ , you predict the symbol arg max x p(x|σ ).
This has one key consequence in our calibrating neural networks: the minimal attainable time-averaged probability P min e of error in predicting the next symbol can be explicitly calculated as: www.nature.com/scientificreports/(The following considers binary alphabets, so that 1 − max x p(x|σ ) = min x p(x|σ ) .)We are also able to calculate the Shannon entropy rate h µ directly from the ǫ-machine 14 via: (1) At top, we see a nonunifilar hidden Markov model that is not an ǫ-Machine because, when in state A, knowing that you have emitted a 0 does not uniquely determine to which state one has transitioned.At bottom, we see the corresponding ǫ-Machine, for which in every state, knowing the emitted symbol uniquely determines the next state.For this ǫ-Machine, we have 16 .Note that both hidden Markov models generate an identical infinite-order Markov process: if one sees a past of all 0's, one has not "synchronized" to the internal hidden state of the ǫ-Machine.Therefore, there is not a complete one-to-one correspondence between sequences of observed symbols and hidden states.In contrast, until recently determining h µ for processes generated by nonunifilar HMCs was intractable.The key advance is that for these processes we recently solved Blackwell's integral equation 19,20 .

Recurrent neural networks
Let s t ∈ R d be the state of the learning system-perhaps a sensor-and let x t ∈ R N be a time-varying N-dimensional input, both at time t.Discrete-time recurrent neural networks (RNN) are input-driven dynamical systems of the form: where f θ (•, •) is a function of both sensor state s t and input x t with parameters θ .See Fig. 2.These parameters are weights that govern how s t and x t affect future sensor states s t+1 , s t+2 , . . . .Alternative RNN architectures result from different choices of f.See below.Long Short-Term Memory Units (LSTMs) and Gated Recurrent Units (GRUs) are often used to optimize prediction of input.For simplicity, the following considers scalar time series: N = 1.Generally, RNNs are hard to train, both in terms of required data sample size and compute resources (memory and time) 21 .RCs 2,22 , also known as echo state networks 23 and liquid state machines 24,25 , were introduced to address these challenges.
RCs involve two components.The first is a reservoir-an input-dependent dynamical system with high dimensionality d as in Eq. (3).And, the second is a readout layer û -a simple function of the hidden reservoir state.Here, the readout layer typically employs logistic regression: with regression parameters a û and b û .To model binary-valued processes, our focus here, we have: The regression parameters are easily trained and can include regularization if desired.Note that while s was used as the input into the logistic regression probabilities, one can move to nonlinear readout by also using ss ⊤ to inform the logistic regression probabilities.

'Typical' RCs
In the following, as a model of typical RCs, a subset of RC nodes are updated linearly, while others are updated according to a tanh(•) activation function.Let s = s nl s l .We have: and where v l,nl controls how strongly the input affects the state, b l,nl is a bias term, and W l,nl are the weight matrices.
The weight matrices W l,nl are chosen to have: 0 entries based on a small-world network of density 0.1 and β = 0.1 ; nonzero entries normally distributed according to the standard normal; and a spectral radius ∼ 0.99 to guarantee the RC fading-memory condition 23 .Different recipes for choosing which nodes were connected (2) . www.nature.com/scientificreports/(small-world networks with varying β and density), what distribution the weights were drawn from (normal versus uniform), and whether or not there was a bias term were tried.These variations had virtually no effect on the results.The only thing that clearly made a difference was whether or not the readout was linear (logistic regression) or nonlinear (logistic regression on a concatenation of s and ss ⊤ ).These two cases are clearly demarcated in the appropriate figure.There was no bias term in the figure shown.

'Next generation' RCs
Next-generation RCs employ a simple reservoir that tracks some amount of input history and a more complex readout layer 5 to improve accuracy over RC's universal approximation property.The reservoir records inputs from the last m timesteps and, then, uses a readout layer consisting of polynomials of arbitrary order.Technically, nextgeneration RCs are a subset of general RCs in that a reservoir can be made into a shift register that records the last m timesteps.As introduced in Ref. 5 next-generation RCs solve a regression task, but they can easily be modified to solve classification tasks.The following simply takes second-order polynomial combinations of the last m timesteps and uses those as features for the logistic regression layer.In other words, let s t = (x t , x t−1 , ..., x t−m+1 ) , a column vector, be the state of the reservoir; then we use s t and s ⊤ t s t as input to the logistic regression.

LSTMs
In contrast, long short-term memory networks (LSTM) 9 take a different approach by optimizing f θ for training and for retaining memory.There, s is a combination of several hidden states and the update equations for the network are given in Ref. 9 .An LSTM's essential components consist of linearly-updated memory cells that make training easier and avoid exploding or vanishing gradients and a forget gate that may improve performance by allowing the network to access a range of timescales 4 .

Prediction error bounds
No matter the RNN, the conditional entropy of the next input symbol X t given the learning system's state S t , places a fundamental upper bound on the RNN prediction performance through Fano's inequality: In this, P e is the time-averaged probability of making an error in predicting the next symbol x t from RNN's state s t , and H b is the binary entropy function.We have also invoked stationarity of the time series, to remove the dependence on t in the steady-state operation of the RNN.In particular, for a binary process where |A| = 2: where H −1 b , defined on the domain [0, 1], is the inverse of H b on its monotonically increasing domain [0, 1/2].In other words, the measure of RNN performance is given by a function of H[X 0 |S 0 ] that lower bounds P e , coupled with the minimal attainable probability of error calculable directly from the ǫ-machine as described in Section "Background".The lower the model's conditional entropy, the better prediction performance.For any RNN, due to the Markov chain S 0 → ← − X 0 → X 0 , this cannot be lower than h µ -the entropy rate: Notably, the next-generation RC takes into account only the last m timesteps, so that: where the myopic entropy rate h µ (m) ≥ h µ is discussed at length in Ref. 26 .

Results
We are now ready to calibrate RNN and RC performance on the task of time-series prediction.First, we survey the performance of RCs when predicting a random sample of typical complex stochastic processes.Second, we explore RC performance on an "interesting" complex process-one from the family of memoryful renewal processes-hidden semi-Markov processes with infinite Markov order.Third and finally, we compare the prediction performance of RCs, next-generation RCs, and LSTM RNNs on a large suite of complex stochastic processes.

Limits of next-generation RCs predicting "typical" processes
We construct exemplars of "typical" complex processes by sampling the space of ǫ-machines as follows: • An arbitrary large number of candidate states is chosen for the HMC stochastic process generator.This parallels the fact that most processes have an infinite number of causal states 15,27 ; . www.nature.com/scientificreports/ • For each ( σ , x) pair, a labeled transition σ x − → σ ′ is randomly generated, with the destination state σ ′ chosen from the uniform distribution over candidate states; • Symbol emission probabilities p(x|σ ) are randomly generated from a Dirichlet distribution with uniform concentration parameter α; • We retain the largest recurrent component of this construction as our sample ǫ-machine.
Numerically, we find that approximately 20% of the candidate states become transients in the constructed directed network, which are then trimmed from the final ǫ-machine.This number of transients strongly clusters around 20% as the number of candidates grows large.(Note that this is a topological feature, independent of α .)Moreover, this candidate network typically has a single recurrent component.Accordingly, the resulting causal states typically number about 80% of the candidate states in our construction, as the number of candidate states grows large.
This results in a finite-state unifilar HMC or, equivalently, a presentation that can generate a process with a finite number of causal states.Interestingly, though, the process generated is usually infinite-order Markov 28 .This can be seen from the mixed-state presentation that describes optimal prediction 20,26 , whose transient states of uncertainty generically maintain nonzero probability even after arbitrarily long observation time.[This is typical even when the mixed-state presentation has a finite number of transient states.Adding a further challenge to the task of prediction, though, the mixed-state presentation typically has infinitely many transient states.] An expression for the myopic entropy rate h µ (m) was developed in Ref. 26 that allows one to exactly compute h µ (m) from the generating ǫ-machine's mixed-state presentation.However, for binary-valued processes it was more straightforward to explicitly enumerate possible length-m futures.Note, though, that this is impractical for the trajectory lengths used here if the emitted-symbol alphabet is larger than two.Figure 3(top) shows h µ (m) as a function of m, in the case that α = 1 .Figure 3(bottom) shows percentage increases in the P e lower bounds www.nature.com/scientificreports/for next-generation RCs above and beyond the minimal P min e , tracking prediction error lower bounds given by Fano's inequality in Section "Results".
Across this family of stochastic processes, typical values of the myopic entropy rate h µ (m) and the entropy rate h µ exhibit a concentration of measure as the number of causal states grows large, with values clustering around 1/2 nat (not shown here).Typical values of the percentage increase in the P e above and beyond the minimal P min e show a concentration of measure, and the minimum probability P min e of error cluster around 1/4 (not shown here), reminiscent of the process-survey results reported by Ref. 29 .
A quick plausibility argument suggests that there is a genuine concentration of measure for these two quantities, using the formulae in Section "Background".Roughly speaking, when the ǫ-machine generator has a large number of causal states, the transitions from any particular state have little effect on the stationary state distribution p(σ ) .Hence, h µ and P min e are roughly the sum of N i.i.d.random variables.The Central Theorem dictates for the concentration parameter α = 1 that h µ estimates should cluster around 1/2 nat and that P min e should cluster around 1/4.In contrast, H[X 0 ] has the larger expected value of ln(2) nats, which becomes typical as the number of causal states grows large.The gradual decay of uncertainty from ln(2) to 1/2 nat per symbol can only be achieved by predictors that (at least implicitly) synchronize to the latent state of the source via distinguishing long histories.
These typical processes are surprisingly non-Markovian, exhibiting infinite-range correlation.A process' degree of non-Markovianity is reflected in how long it takes for h µ (m) to converge to h µ : how large must m be to synchronize?Even after observing m = 15 symbols, these processes (with a finite but large number of causal states) are still ∼ 0.2 nats away from synchronization.This convergence failure contributes to a minimal probability of error that cannot be circumvented no matter the cleverness in choosing the RC nonlinear readout function.

Limits of next-generation RCs predicting an "interesting" process
References 11,12 define complex and thus "interesting" processes as those that have infinite mutual information between past and future-the so-called "predictive information" or "excess entropy".The timescales of predictability are revealed through the growth I pred (m) as longer length-m blocks of history and future are taken into account.The predictive information is: And so, its growth rate is: That is: The gap between h µ (m + 1) and h µ quantifies the excess uncertainty in the next observable, due to observation of only a finite-length past.This is governed by I pred (m + 1) − I pred (m) in discrete-time processes or, analogously, by dI pred (t)/dt in continuous-time processes.
What constitutes an acceptable increase in prediction error above and beyond h µ ?The intuition for this follows from inverting Fano's inequality to determine the additional conditional entropy implied by a substantial increase in the probability of error.
To illustrate this, we turn to an interesting process that has a very slow gain in predictive information-the discrete-time renewal process shown in Fig. 1(Bottom), with survival function: Discrete-and continuous-time renewal processes are encountered broadly-in the physical, chemical, biological, and social sciences and in engineering-as sequences of discrete events consisting of an event type and an event duration or magnitude.An example critical to infrastructure design occurs in the geophysics of crustal plate tectonics, where the event types are major earthquakes tagged with duration time, time between their occurrence, and an approximate or continuous Richter magnitude 30 .Another example is seen in the history of reversals of the earth's geomagnetic field 31 .In physical chemistry they appear in single-molecule spectroscopy which reveals molecular dynamics as hops between conformational states that persist for randomly distributed durations 32,33 .A familiar example from neuroscience is found in the spike trains generated by neurons that consist of spikeno-spike event types separated by interspike intervals 34 .Finally, a growing set of renewal processes appear in the quantitative social sciences, in which human communication events and their durations are monitored as signals of emergent coordination or competition 35 .
At β = 1 , this discrete-time renewal process has I pred (m) ∼ log log m 36 .The minimal achievable lower bound is P min e = 0.0001 .Due to an additional ∼ 0.1 nats from not using an infinite-order memory trace and instead only using the last m = 5 symbols, the probability-of-error lower bound jumps to 0.02.This is a percentage Vol:.( 1234567890) www.nature.com/scientificreports/increase in probability of error of 10 4 % at m = 11 timesteps-about two and half orders of magnitude worse than that of a typical complex process.We emphasize these are fundamental bounds that no amount of cleverness can circumvent.While any nonlinear readout function might be chosen for a next-generation RC, the process' inherent complexity demands that an infinite-order memory trace be used for relatively good prediction.References 37,38 constructed an HMC that ergodically 39 generated I pred (m) ∼ log m .For this process: Consider a process that has a "typical" entropy rate of 0.5 nats, we can invert Fano's inequality-that is not necessarily tight-to find a lower bound on the probability of error with an infinite memory trace.Assuming this lower bound, the bound on the percentage increase of the probability of error above and beyond P min e decays to 10% only when the RC uses more than m = 1000 symbols.See Fig. 4(bottom).

RCs, next-generation RCs, and state-of-the-art RNNs predicting highly non-markovian processes
Knowing that there are fundamental limits to the next-generation RC's ability to predict processes forces the question: how well do next-generation RCs actually do at predicting these processes when using second-order polynomial readout?Moreover, do more RCs and state-of-the-art RNNs do any better?
In all experiments, we are careful to hold the number of input nodes to the readout constant for a fair comparison.
We now compare typical RCs with linear readout, typical RCs with nonlinear readout (second-order polynomial), and LSTMs to next-generation RCs on prediction tasks generated by the large ǫ-machines of Section "Limits of next-generation RCs predicting "typical" processes".Although RCs with nonlinear readout and many more nodes outperform next-generation RCs, Fig. 5 shows that when the number of readout nodes is held constant, next-generation RCs are indeed the best RC possible.This is expected from Ref. 5 .LSTMs h µ (m + 1) ≈ h µ + 1 m .(Bottom) Percentage increase in the lower bound for the probability of error P e above and beyond the minimum using Fano's inequality as a function of time steps m for a process such as that in Ref. 36 .beat all reservoir computers, however, as one can see from the red violin plot of Fig. 5 settling primarily on the lowest possible values of (P e − P min e )/P min e × 100% .This is somewhat expected since LSTMs optimize both the reservoir and readout, although the fact that they do is a testament to the success of backpropagation through time (BPTT) 40 .
Figure 5's surprise is that all RNNs perform quite poorly, leaving at least ∼ 50% increase in the probability of error above and beyond optimal, as one can see from the surprisingly large values on the y-axis, achieved at m = 10 for the next-generation RC.This nearly saturates the lower bound on this percentage increase in the probability of error placed by Fano's inequality.

Conclusion
The striking advances made by RNNs in predicting a very wide range of systems-from language to climate-have not been accompanied by markedly improved explorations of how much structure they fail to predict.Here, we introduced and illustrated such a calibration.
We addressed the task of leveraging past inputs to forecast future inputs, for any stochastic process.We showed that P min e minimal time-averaged probability of incorrectly guessing the next input, minimized over all possible strategies that can operate on historical input-can be directly calculated from a data source's generating ǫ-machine.This provides a benchmark for all possible prediction algorithms.We compared this optimal predictive performance with a lower bound on various RNNs' P e -the actual time-averaged probability of incorrectly guessing the next input, given the state of the model.We found that so-called next-generation RCs are fundamentally limited in their performance.And we showed that this cannot be improved on via clever readout nonlinearities.
In our comparison of various prediction models, we tested next-generation RCs with highly-correlated inputs that are challenging to predict.This input data was generated from large ǫ-machines.The ǫ-machines are the optimal prediction algorithm, and the minimal probability of error for these data are known in closed-form.Our extensive surveys showed, surprisingly, that models from RCs with linear readout to next-generation RCs of reasonable size to LSTMs all have a probability of prediction error that is ∼ 50% greater than the theoretical minimal probability of error.
The fact that simple large random ǫ-machines generate such challenging stimuli might be a surprise.Recently, though, it was reported that tractable ǫ-machines can lead to "interesting" processes 11,12 .We showed that these processes provide even more of a challenge for next-generation RCs.
At first, it may seem that this new calibration is somewhat useless, both theoretically and from a practical point of view.For instance, it is perhaps not surprising that RCs, NGRCs, and maybe even LSTMs perform poorly on highly non-Markovian processes such as the ones used here.However, with N nodes, one can find N predictive features that potentially reach far back into the past even though one might naively think that the N features correspond to the last N time points.Secondly, the processes used here are not of general interest, as large random ǫ-machines do not correspond to real-world signals in structure.However, one can manufacture ǫ-machines that do have the structure of real-world signals, as any real-world signal can be represented by an ǫ-machine.Then, the calibration here can improve the RC or RNN's ability to predict real-world signals.This potential research program extends even to nonstationary real-world data.Both natural language and natural data from the physical world can be understood as stochastic processes which, in principle, have some ǫ-machine representation.While we focused on stationary processes in this manuscript, nonstationary processes can be for 100 ǫ-machines with 300 candidate states.The next-generation RC has 10 timesteps as input; the typical RC with nonlinear readout has 10 nodes with 5 linear nodes; the typical RC with linear readout has 110 nodes with 10 linear nodes; and the LSTM has 110 nodes.The number of nodes has been chosen so that the number of readout nodes is equivalent across machines.Note that these nearly saturate the lower bound provided by Fano's inequality. https://doi.org/10.1038/s41598-024-58814-0www.nature.com/scientificreports/

Figure 2 .
Figure 2. A recurrent neural network (RNN) for which the future state of the recurrent node depends on its previous state and the current input.The present state of the recurrent node is then used to make a prediction. https://doi.org/10.1038/s41598-024-58814-0

Figure 3 .
Figure 3. (Top)Finite-length entropy rate h µ (m) in nats for typical random unifilar HMCs constructed with 30 (blue), 300 (orange), and 3000 (green) candidate states as a function of the number of input timesteps m. (Bottom) Increase of the lower bound on the probability P e of error from Fano's inequality, above and beyond P min e , with the same random unifilar HMCs as a function of the number of input timesteps m.Since occassionally the lower bound on this quantity fell below 0% , the maximum of 0% and the quantity is used.90% confidence intervals are shown on both graphs.

Figure 4 .
Figure 4.Predicting a discrete-time fractal renewal process with infinite excess entropy: (Top) Percentage increase in the lower bound for the probability of error P e above and beyond the minimum using Fano's inequality as a function of time steps m. (Bottom) Percentage increase in the lower bound for the probability of error P e above and beyond the minimum using Fano's inequality as a function of time steps m for a process such as that in Ref.36 .

Figure 5 .
Figure 5. Percentage increase in the probability of error of trained next generation RCs (green), trained RCs with linear readout (orange), trained RCs with nonlinear readout (blue), and trained LSTMs (red) above and beyond P min e